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Abstract 

In contrast to comparing faces via single exemplars, match- 
ing sets of face images increases robustness and discrimina- 
tion performance. Recent image set matching approaches 
typically measure similarities between subspaces or mani- 
folds, while representing faces in a rigid and holistic man- 
ner. Such representations are easily affected by variations 
in terms of alignment, illumination, pose and expression. 
While local feature based representations are considerably 
more robust to such variations, they have received little at- 
tention within the image set matching area. We propose 
a novel image set matching technique, comprised of three 
aspects: (i) robust descriptors of face regions based on 
local features, partly inspired by the hierarchy in the hu- 
man visual system, (ii) use of several subspace and ex- 
emplar metrics to compare corresponding face regions, 
(Hi) jointly learning which regions are the most discrimi- 
native while finding the optimal mixing weights for combin- 
ing metrics. Experiments on LFW, PIE and MOBIO face 
datasets show that the proposed algorithm obtains consid- 
erably better performance than several recent state-of-the- 
art techniques, such as Local Principal Angle and the Ker- 
nel Affine Hull Method. 

1. Introduction 

A recent trend in image set matching considers image 
sets as linear subspaces, with the similarity between the sets 
derived from the similarity between the subspaces [5, 13, 
35, 36]. In almost all subspace based approaches, faces are 
represented in a rigid and holistic manner, where each face 
is represented by one feature vector that describes the entire 
face. Such a representation implicitly embeds rigid spatial 
constraints between face components [4] . 

While subspaces are thought of being capable of ac- 
commodating the effects of various image variations 1 , the 
magnitude and compounding effect of variations (such as 
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1 For example, a linear subspace can be used for photometric invari- 
ance, under the conditions of no shadowing and Lambertian reflectance [1]. 



illumination, pose and expression changes) might over- 
whelm even the most sophisticated subspace modelling 
technique. The relatively poor performance of linear mod- 
els in such challenging recognition tasks appears to have 
roots in the non-linear nature of typical image manifolds 
[19, 33, 35], with much effort directed towards handling the 
non-linearities (eg. via kernel extensions [13, 35] and data 
clustering [9, 11, 33]). 

In contrast to rigid face representations, a face can also 
be represented by a set of local features. This set can then 
be processed by a classifier that explicitly allows relaxed 
spatial constraints between face parts. Such a combination 
allows for some movement and/or deformations of the face 
components [4, 16, 27], which in turn leads to a degree of 
inherent robustness to expression and pose changes [16, 27] 
as well as misalignment [4]. Examples of such systems 
include Elastic Graph Matching [34], pseudo-2D hidden 
Markov models [4], and "bag of words" approaches [28]. 

Several studies in the domain of single-image to single- 
image matching have shown that non-linear structures can 
be effectively avoided by local representations. More pre- 
cisely, while the structures that describe holistic features 
tend to be non-linear and complex, linear structures are 
good/sufficient tools to approximate local features [18, 24]. 
As such, rather than using holistic face representations and 
relying on a model to handle the resulting non-linear varia- 
tions, it might be more appropriate to develop an image set 
matching technique based on local representations, while 
allowing relaxed spatial constraints. 

We propose an approach for image set face verifica- 
tion that uses a multitude of local representations and dis- 
tance metrics, and employs a learning algorithm to deter- 
mine which subset of descriptors and their associated met- 
rics is the most useful for discrimination. More specifically, 
from each image two types of robust local descriptors are 
obtained: region descriptors and compound region descrip- 
tors. The compound descriptors are inspired by the hierar- 
chical architecture of the human visual system, where the 
receptive fields of neurons tend to get larger in order to 
deal with increasingly complex stimuli [29]. The descrip- 
tors from corresponding regions in two face sets are pooled 
and then compared via several distance metrics (instead of 
relying on only one), resulting in a high-dimensional simi- 
larity vector. As such, the image set verification problem is 
converted to a binary problem on similarity features. 
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Figure 1. Partly inspired by the human visual system [29], the 
proposed approach has a hierarchical structure. Level is the im- 
age plane. Level 1 contains descriptors for regions within the face, 
with the regions having arbitrary sizes and locations. Each de- 
scriptor is a probabilistic histogram, obtained using a dictionary of 
visual words. Level 2 contains compound descriptors, generated 
by aggregating the descriptors from Level 1 . The descriptors from 
Level 1 and 2 are fed to a learning mechanism which determines 
which subset of descriptors is the most useful for face verification. 



By learning to separate similarity vectors representing 
matched sets (ie. sets of the same person) and mismatched 
sets (ie. sets of two persons), we are in effect jointly de- 
termining which regions are the most discriminative while 
finding the optimal mixing weights for combining metrics. 
Fig. 1 shows a conceptual overview of the approach. 

We continue the paper as follows. The feature extraction 
process is described in Section 2. The details of the learning 
approach are given in Section 3. Comparative evaluations 
of the proposed method against other image set matching 
techniques are given in Section 4. The main findings and 
possible future directions are covered in Section 5. 

2. Hierarchical Feature Extraction 

As shown in Fig. 1, the feature extraction is hierarchical 
in nature, with 3 levels. The lowest level (level 0), is the 
image plane. The details for the feature extraction at levels 1 
and 2 are given in Sections 2.1 and 2.2, respectively. 

2.1. Level 1 

Each descriptor in level 1 corresponds to a relatively 
large region in the image plane. The descriptor for region 
size of p x p, at an arbitrary location, is constructed as fol- 
lows. In a similar manner to [28], the region is split into 
small overlapping blocks, with each block having a size of 
8x8. For each block a histogram of probabilities is calcu- 
lated, where each entry in the histogram reflects the similar- 



ity of the block to a pre-defined 'visual word'. Each region 
is represented as the average of all the histograms obtained 
for the region's blocks. The procedure is elucidated below. 

Each block is represented by a low-dimensional texture 
descriptor. For each texture descriptor x r ^ obtained from a 
block in region r, a probabilistic histogram is computed: 
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where the g-th element in h r ,i is the posterior probability 
of x r ,i according to the g-th component of a visual dictio- 
nary model. The visual dictionary model employed here 
is a convex mixture of Gaussians [3], parameterised by 
A = {w g ,fjb g ,Cg} N f l9 where Ng is the number of Gaus- 
sians, while w g , /it g and are the weight, mean vector 
and covariance matrix for Gaussian g, respectively. The 
mean of each Gaussian can be thought of as a particular 
'visual word'. The visual dictionary is obtained by pooling 
a large number of texture descriptors from training images, 
followed by employing the Expectation Maximisation algo- 
rithm [3] to find the dictionary's parameters (i.e., A). 

In this work we use local texture descriptors based on 
DCT analysis with illumination normalisation [28]. How- 
ever, it is possible to use other texture descriptors, eg., based 
on Gabor wavelets [20] or Local Binary Patterns [2]. 

Once the histograms are computed for each feature vec- 
tor from region r, an average histogram for the region is 
built: h r ,av g = ^ J2f=i hr,i- Due to the averaging operation, 
in each region there is a loss of spatial relations between 
face parts. As such, each region is in effect described by an 
orderless collection of local features ('bag-of-words'). The 
loss of spatial relations allows for a degree of misalignment, 
pose variations and expression changes [4, 16, 27, 28]. 

2.2. Level 2 

In the human visual system, the receptive fields of neu- 
rons tend to get larger in order to deal with increasingly 
complex stimuli [29]. The responses of complex cells can 
be pooled from the responses of adjacent simple cells using 
'max' or 'sum' operations [25, 29]. In a similar manner, we 
use three configurations for combining the descriptors from 
level 1, using the 'sum' operation. 

The three configurations are shown in Fig. 2. The first 
configuration is in effect a horizontal shape. The com- 
pound descriptor in this case is a summation of three re- 
gions, i.e. simple cells, where the centers of the two outer 
regions are located at (— (p - d),0) and [p - d, 0) relative 
to the center of the middle region, where p x p is the re- 
gion size. The second configuration is similar to the first, 
except a vertical shape is used. We conjecture the first con- 
figuration can be useful for capturing horizontally elongated 
structures such as the mouth, eyes and eyebrows, while the 
second configuration can be useful for representing verti- 
cally elongated shapes, such as the nose. 
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Figure 2. Compound descriptors are generated by computing the 
sum over the descriptors from level 1 . We use three configurations: 
(a) for representing horizontally elongated shapes like the eyes and 
mouth; (b) for representing vertical elongated shapes such as the 
nose; (c) a mixture of (a) and (b), for capturing a degree of corre- 
lations between shapes such as the nose and mouth. 

The third configuration is a combination of the previous 
two shapes and forms a cross shape. We believe it can be 
useful for capturing a degree of correlations between the 
appearance of various face parts. For example, the shape of 
the nose might be related to the shape of the mouth. 

3. Determining Salient Descriptors 

An image set face verification system needs to determine 
whether two sets, A and B, represent the same person. In 
general this is accomplished by comparing the similarity 
between the two sets to a predefined threshold r. 

We assume that the image set A is comprised of I images. 
Each image i is represented by v descriptors (histograms 
from level 1 and 2), h\ ] , h$ , • • • , h$ , with each descriptor 
covering a particular region. We define a local mode as a 
matrix which contains all descriptors for region j from the 
I images: 
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To compare two corresponding local modes from sets A 
and B, ie., Lf and L®, instead of relying on only one sim- 
ilarity measure, we propose to use k similarity measures: 
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overall similarity vector between sets A and B as contain- 
ing k similarity measures for each local mode, resulting in 
a k ^-dimensional vector: 



(3) 



d k (Lf,L®) 
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The image set verification problem is hence converted to a 
binary classification problem involving similarity vectors. 
Figure 3 provides a graphical interpretation. 

We use two families of similarity measures: subspace 
based, and exemplar based. For the subspace based mea- 
sures, we employ the Grassmannian geodesic distance (arc- 



length) and Binet-Cauchy distance [12]. For the exemplar 
based measures, we use Hausdorff and Modified Hausdorff 
distances [7]. The two families are elucidated below. 

For the subspace based measures, each local mode Lf 
is modelled by a linear subspace. A common similar- 
ity measure between subspaces is the concept of prin- 
cipal angles [36]. If O x e M dxni and 2 e K dxn2 
are two linear subspaces in R d with minimum rank r = 
mm(rank(Oi, 2 )), then there are exactly r uniquely de- 
fined principal angles between Oi and 2 : 
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subject to xjxi = yjy i = 1, xfxj = y^y- =0,ij=j. A 
straightforward method for computing the principal angles 
is based on Singular Value Decomposition. More specif- 
ically, the cosines of the principal angles are the singular 
values of 0^0 2 : 

Ol0 2 = UAV T (5) 

where the singular values are the diagonal entries of A. 

Based on the above principal angles, we use two similar- 
ity measures: Grassmannian geodesic distance and Binet- 
Cauchy distance, defined respectively as [12]: 
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For the exemplar based measures, local modes are 
compared using Hausdorff and Modified Hausdorff dis- 
tances [7]. Given two corresponding local modes Lf and 

, the Hausdorff distance (HD) is defined as: 

duB\Lf L^) =max (maxmin II a 

Intuitively, if the Hausdorff distance is d, then every 
point of A must be within a distance d of some point B and 
vice versa. For image processing applications, Dubuisson 
et al. [7] proposed the modified Hausdorff distance (MHD), 
which is more robust against outliers: 

dMnr>(Lf,L®) = max(dM(Lf,L®),dM(L®,Lf) S j (9) 

where d M (L$, Lf) = ^ EaeA min ^i \\a - b\\, with |A| de- 
noting the cardinality of set A. 

Due to the dense nature of the feature extraction process, 
a hefty and redundant representation is available for any im- 
age, leading to a very high dimensional similarity vector. 
A further contributing factor to the high dimensionality is 
the use of four distance metrics per local mode. As such, in- 
stead of blindly feeding the similarity vectors to a standard 
learning mechanism such as a Support Vector Machine [3], 
we have elected to use an adapted version of the AdaBoost 
algorithm [31], which is more suitable for dealing with such 
high dimensional problems. 




Figure 3. Converting the image set verification problem to a bi- 
nary problem on similarity features. Each region in a single im- 
age for a given person is described by an average histogram of 
visual words. The corresponding histograms for a particular re- 
gion across several images form a local mode. The corresponding 
local modes from two people are compared using several distance 
metrics. All the resulting distances for all modes are placed into a 
similarity vector. 

In the adapted AdaBoost, each weak learner works for a 
single feature each time. As a result after Q rounds of boost- 
ing, Q features are selected. The adapted version hence 
has a considerably lower computational complexity than the 
original version [10]: in a D-dimensional problem, Q com- 
parisons are required instead of Q x D. 

4. Experiments 

In this section we first provide an overview of the im- 
age datasets used in the experiments (Section 4.1), fol- 
lowed by a comparative performance evaluation against 
several benchmark and recent state-of-the-art methods (Sec- 
tion 4.2). 

4.1. Image Datasets 

We employed 3 datasets: Labeled Faces in the Wild 
(LFW) [17], CMU PIE [30] and MOBIO [22]. The datasets 
contain various face orientations, expressions, illumination 
situations and occlusions. A verification setup similar to 
the LFW protocol [17] is used, where the task is to classify 
a pair of previously unseen image sets as either belonging 
to the same person (matched pair) or two different persons 
(mismatched pair). In all experiments the images are split 
into three groups: (i) training, (ii) development, (iii) eval- 
uation. The training group was used purely for construct- 
ing the visual dictionary — its subjects were never seen in 
the development and evaluation groups. Experiments on all 



datasets were carried out on face images which are closely 
cropped and downsampled to a size of 64 x 64. Each image 
set contains three images. The number of matched pairs and 
mismatched pairs is the same (balanced), in order to prevent 
a bias towards one of the pair types. 

For the LFW dataset, 620 pairs of image sets were gen- 
erated, with 310 pairs for development group and 310 pairs 
for evaluation group. The generic subset from LFW view 1 
was used for the training group. 

For the CMU PIE dataset, we used the near frontal poses 
(C05, C07, C09, C27 and C29), resulting in 170 images 
per subject with various illuminations and expressions. We 
randomly selected 8 subjects for the training group while 
development and evaluation groups each have 30 subjects. 
1,200 pairs of images were generated, with the development 
and evaluation groups having 600 pairs each. 

The MOBIO dataset contains images captured from mo- 
bile devices. The quality of the images is generally poor 
with blurring from motion and smudged lenses, as well as 
changes in illumination between scenes. A Haar-based cas- 
cade classifier [32] was used to locate faces in each frame. 
The eyes within each face are located using a similar cas- 
cade classifier. If no eyes are located, their approximate 
location is inferred from the size of the face bounding box. 
The faces are then resized and cropped such that the eyes are 
centered with a 32-pixel inter-eye distance. We used the de- 
velopment subset of MOBIO, which contains 1,500 probe 
videos from 20 females and 27 males. We generated 832 
pairs of images for the development group and 800 pairs for 
the evaluation group. The background data subset was used 
as the training group. 

4.2. Comparative Performance Evaluation 

The proposed approach is compared against several 
benchmark methods as well as recent state-of-the-art meth- 
ods. The evaluated methods are representative techniques 
for exemplar-based and subspace-based approaches. 

The exemplar-based techniques are: Laplacianface [15], 
Local Binary Pattern (LBP) [2], Multi-Region Histograms 
(MRH) [28], and Local Facial Features (LFF) [6]. The 
subspace-based techniques are: Mutual Subspace Method 
(MSM) [36], Kernel Affine Hull Method (KAHM) [5], and 
Local Principal Angle (Local-PA) [21]. 

We note that the above approaches can also be classified 
as either local or holistic in terms of the underlying feature 
extraction. LBP, MRH, LFF, Local-PA and the proposed ap- 
proach are in the local based category, while Laplacianface, 
MSM and KAHM are in the holistic based category. 

Similarity judgements in exemplar-based methods 
were carried out using the Modified Hausdorff Dis- 
tance (MHD) [7] . The KAHM approach used a linear ker- 
nel with the parameters tuned according to the recommen- 
dations made in [5]. The best results are reported. For 



LBP, uniform histograms with (8, 1) neighbourhoods are 
employed. The LBP block size was selected empirically 
as 7 x 9. In Laplacianface, the subspace dimensions were 
set by retaining enough leading eigenvectors to account for 
98% of the overall energy in the eigen-decomposition. In 
Local-PA, the block size was 16 x 16, also obtained empir- 
ically. 

Based on preliminary experiments, the proposed ap- 
proach used the following parameters: the size of each re- 
gion is 24 x 24, dimension of each DCT-based texture de- 
scriptor is 15, and the number of visual words in the dictio- 
nary is 1024. 

To generate compound descriptors, the distance between 
centers of simple cells (regions in the image plane) was se- 
lected as 4, 8 and 12. For images of size 64 x 64, this results 
in 1681 direct regions and 8153 compound regions. As four 
distance metrics are used for each local mode, the dimen- 
sionality of the resulting similarity vector for each image 
set pair is 39336. The discrimination performance appears 
to stabilise with a subset of 150 similarity features, as se- 
lected by the AdaBoost algorithm. 

An example of cumulative weights of the most discrim- 
inant local modes obtained by the boosting algorithm is 
shown in Fig. 4. Cumulative weight for a pixel I(x,y) is 
defined as the sum of the weights of the selected regions 
that include the pixel. Most of regions are selected from the 
inner part of the face, with stress on the regions around the 
mouth, nose and eyes. 

The comparative results are shown in Table 1, with the 
verification accuracy defined as the average of the accuracy 
on matched and mismatched pairs. The relatively poor per- 
formance of the Laplacianface approach implies the diffi- 
culty of the recognition task, considering that the method 
is expected to perform relatively well if the imaging condi- 
tions do not differ greatly between training and test datasets. 

The results show that in all experiments local approaches 
prevail over holistic techniques. This confirms the premise 
of this work: relaxed local representations are more robust 
than rigid holistic representations. Among exemplar-based 
methods, MRH and LFF outperform Laplacianface and 
LBP. Among the subspace approaches, Local PA outper- 
forms MSM. We note that KAHM is marginally supe- 
rior to MSM (with the exception of CMU-PIE), however 
LBP+KAHM significantly outperforms MSM for all exper- 




(a) (b) 
Figure 4. An example of the cumulative weights for face regions 
selected by the boosting algorithm: (a) cropped face from PIE; 
(b) brighter regions correspond to higher cumulative weights. 



Table 1. Average verification accuracy on LFW, PIE and MOBIO 
datasets. The methods are grouped into two categories: (a) exem- 
plar based, and (b) subspace based. The proposed method uses 
both exemplar and subspace based similarity metrics. 



Method 



LFW PIE MOBIO overall 



Laplacian[15] + MHD 65.48 69.17 85.50 73.38 

LBP[2] + MHD 79.35 78.17 94.75 84.09 

MRH [28] + MHD 86.45 75.50 96.75 86.23 

LFF [6] 88.06 78.17 97.75 87.99 



(a) 



(b) 



MSM [36] 
Local-PA [21] 
KAHM [5] 
LBP + KAHM [5] 



65.48 71.33 90.13 75.65 

67.10 77.17 92.50 78.92 

66.13 67.83 90.38 74.78 

73.22 76.00 95.38 81.53 



Proposed method 



95.80 91.00 100.00 95.60 



iments. This is consistent with the results reported in [5]. 

The proposed approach surpasses all other methods by 
a considerable margin on the LFW and PIE datasets. On 
LFW, the performance difference to LFF, the nearest com- 
peting approach, is 7.7 percentage points. On PIE, the im- 
provement over the nearest method is close to 13 percentage 
points. 

5. Main Findings and Future Directions 

We have proposed a novel image set matching technique 
for face verification, comprised of three aspects: (i) robust 
descriptors of face regions based on local features, partly in- 
spired by the hierarchy in the human visual system, (ii) use 
of several subspace and exemplar metrics to compare cor- 
responding face regions, (iii) jointly learning which regions 
are the most discriminative while finding the optimal mix- 
ing weights for combining metrics. Experiments on LFW, 
PIE and MOBIO face datasets show that the proposed algo- 
rithm obtains considerably better performance than several 
recent state-of-the-art techniques, such as Local Principal 
Angle and the Kernel Affine Hull Method. 

We note that the region descriptors used in Section 2 
somewhat resemble Sparse Representation (SR) and dic- 
tionary learning, as they are obtained through an over- 
complete visual dictionary [8]. While SR methods usually 
utilise greedy algorithms like Matching Pursuit or convex 
optimisation [8] (which are computationally expensive), the 
descriptors here are obtained through closed-form equa- 
tions. This is useful in large-scale data processing appli- 
cations. 

While the learning method presented here is specific to 
a verification system (ie. binary classification), extension to 
arbitrary M-class discrimination problems is possible. An 
M-class problem can be converted into a binary problem 
via the use of intra- and inter-personal spaces [23]. More 
specifically, instead of characterising class clusters, it is 
possible to characterise what kind of image variation is typ- 



ical for the same person and what is for different persons. 
Theoretically this is achieved by training a binary classifier 
on the differences between two samples, ie. A = Si - S 2 - 
Based on learning the differences, two samples (here two 
sets obtained from a specific local region) are considered as 
representing the same person if they are classified as intra- 
personal variation. Conversely, two samples represent two 
unique individuals if their difference is classified as extra- 
personal variation. 
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